Statistical Machine Translation without Parallel Data

نویسنده

  • Maryam Siahbani
چکیده

We examine approaches of statistical machine translation without parallel data (SMT). SMT has achieved impressive performance by leveraging large amounts of parallel data in the source and target languages. But such data is available only for a few language pairs and domains. Using human annotation to create new parallel corpora sufficient for building a good translation system is too expensive. On the other hand, there are many resources of text for many languages. This has raised a new research challenge in SMT: How can we train a statistical language translation system without parallel data? There has been a long line of research on learning translation from monolingual data, beginning with Rapp (1995). Many of these works have focused on extracting a translation lexicon by mining monolingual resources of data. More recent works have extended research from building translation lexicons to translation systems. We categorize methods of SMT without parallel data based on end product (translation lexicon or translation system) and type of available resources which can be used (limited parallel information or just monolingual data). We characterize the methods in terms of quality of translation and scalability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Statistical Machine Translation Performance by Training Data Selection and Optimization

Parallel corpus is an indispensable resource for translation model training in statistical machine translation (SMT). Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting full potential of the existing parallel corpora. Two kinds of methods are proposed: offline data optimization and online model optimization. The offline method...

متن کامل

Training Data in Statistical Machine Translation - the More, the Better?

Current statistical machine translation (SMT) systems are stated to be dependent on the availability of a very large training data for producing the language and translation models. Unfortunately, large parallel corpora are available for a limited set of language pairs and for an even more limited set of domains. In this paper we investigate the behavior of an SMT system exposed to training dat...

متن کامل

Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction

The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including paral...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Catalan-English statistical machine translation without a parallel corpus

This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. G...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012